Two-level checkpointing and partial verifications for linear task graphs
نویسندگان
چکیده
Fail-stop and silent errors are unavoidable on large-scale platforms. Efficient resilience techniques must accommodate both error sources. A traditional checkpointing and rollback recovery approach can be used, with added verifications to detect silent errors. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpoint and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms, which are less costly than guaranteed verifications but do not detect all silent errors. In this paper, we show how to combine all these techniques for HPC applications whose dependence graph is a chain of tasks, and provide a sophisticated dynamic programming algorithm returning the optimal solution in polynomial time. Simulations demonstrate that the combined use of multi-level checkpointing and partial verifications further improves performance. Key-words: resilience, fail-stop errors, silent errors, multi-level checkpoint, verification, dynamic programming. ∗ École Normale Supérieure de Lyon † INRIA, France ‡ University of Tennessee Knoxville, USA Checkpoint à deux niveaux et vérifications partielles pour des graphes de tâches linéaires Résumé : Les erreurs fatales et silencieuses ne peuvent plus être ignorées sur des platesformes à grande échelle. Des techniques de résilience efficaces doivent accommoder les deux types d’erreurs. Une approche traditionnelle de checkpoint et points de reprise peut être utilisée, en rajoutant des vérifications afin de détecter les erreurs silencieuses. Une erreur fatale entrâıne la perte de tout le contenu mémoire, d’où l’obligation de faire une sauvegarde sur un support fiable (typiquement un disque). Par contre, il est possible de se satisfaire de checkpoints en mémoire pour les erreurs silencieuses, ce qui donne des surcoûts bien plus faibles. De plus, les détecteurs récents offrent des mécanismes de vérification partielle, qui sont moins coûteux que les vérifications garanties, mais qui ne détectent pas toutes les erreurs silencieuses. Nous montrons comment combiner toutes ces techniques pour des applications HPC dont le graphe de dépendances est une châıne de tâches, et nous donnons un algorithme de programmation dynamique sophistiqué qui renvoie la solution optimale en temps polynomial. Des simulations démontrent que l’utilisation combinée de checkpoint à deux niveaux et de vérifications partielles améliore la performance. Mots-clés : résilience, erreurs fatales, erreurs silencieuses, checkpoint multi-niveaux, vérification, programmation dynamique. Two-level checkpointing and partial verifications for linear task graphs 3
منابع مشابه
Multi-level checkpointing and silent error detection for linear workflows
We focus on High Performance Computing (HPC) workflows whose dependency graph forms a linear chain, and we extend single-level checkpointing in two important directions. Our first contribution targets silent errors, and combines in-memory checkpoints with both partial and guaranteed verifications. Our second contribution deals with multi-level checkpointing for failstop errors. We present sophi...
متن کاملCoping with silent errors in HPC applications
This report describes a unified framework for the detection and correction of silent errors, which constitute a major threat for scientific applications at extremescale. We first motivate the problem and explain why checkpointing must be combined with some verification mechanism. Then we introduce a general-purpose technique based upon computational patterns that periodically repeat over time. ...
متن کاملThe Impact of Linear Process versus Genre-Based Approach on Intermediate EFL Learners’ Accuracy in Written Task Performance
The main purpose of the present quasi-experimental study was to investigate the effects of linear process versus genre-based approach on EFL learners’ written production. To this end, 40 learners of English at intermediate level were randomly selected as the participants of the study and assigned into two groups of experimental (process and genre) which received different types of instruction f...
متن کاملStability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملEfficient checkpoint/verification patterns for silent error detection
Resilience has become a critical problem for high performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their particularities is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a verification mechanism to check whether the application ...
متن کامل